{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this notebook, we walk through the design of target encoding. We start with a motivating example, `criteo dataset`, to show why target encoding is preferred over one hot encoding and label encoding. The concepts and optimizations of target encoding are introduced step by step. The key takeaway is that target encoding differs from traditional sklearn style encoders in the following aspects:\n",
    "\n",
    "- The ground truth column `target` is used as input for encoding.\n",
    "- The training data and test data are transformed differently.\n",
    "- Multi-column joint transformation is supported by target encoding."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Table of contents\n",
    "[1. Motivation](#motivation)<br>\n",
    "> [Criteo data](#criteo)<br>\n",
    "[Why not one-hot encoding?](#onehot)<br>\n",
    "[Label encoding](#lbl)<br>\n",
    "[Train XGB with label encoding ](#lblxgb)<br>\n",
    "\n",
    "[2. Target Encoding](#tar)<br>\n",
    "> [A naive implementation](#naive)<br>\n",
    "[A K-fold cross validate implementation](#kfold)<br>\n",
    "[An optimized implementation](#opt)<br>\n",
    "[Multi-column joint encoding](#multi)<br>\n",
    "\n",
    "[3. Conclusions](#conclusions)<br>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "GPU_id = '0,1,2,3'\n",
    "os.environ['CUDA_VISIBLE_DEVICES'] = GPU_id\n",
    "num_gpus = len(GPU_id.split(','))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import warnings\n",
    "warnings.filterwarnings(\"ignore\")\n",
    "\n",
    "import numpy as np\n",
    "import cudf as gd\n",
    "import cupy as cp\n",
    "from cuml.preprocessing.LabelEncoder import LabelEncoder\n",
    "from cuml.preprocessing.TargetEncoder import TargetEncoder\n",
    "import dask as dask, dask_cudf\n",
    "from dask.distributed import Client, wait\n",
    "from dask_cuda import LocalCUDACluster\n",
    "import xgboost as xgb\n",
    "import matplotlib.pyplot as plt\n",
    "import time"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"motivation\"></a>\n",
    "## 1. Motivation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"criteo\"></a>\n",
    "### Criteo data\n",
    "The [criteo 1-TB benchmark](https://github.com/rambler-digital-solutions/criteo-1tb-benchmark) is a well-known dataset for click through rate modeling. We only use three categorical features to make it a simple dataset for the problem."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "path = '/datasets/criteo/raw_csvs/split_train_data'\n",
    "train_name = f'{path}/day_0_part_0000'\n",
    "valid_name = f'{path}/day_0_part_0001'\n",
    "num_cols = ['num_%d'%i for i in range(13)]\n",
    "cat_cols = ['cat_%d'%i for i in range(26)]\n",
    "cols = ['label']+num_cols+cat_cols\n",
    "dtypes = {i:'str' if i.startswith('cat_') else 'float32' for i in cols}\n",
    "train = gd.read_csv(train_name, sep = '\\t', header=None, names=cols, dtypes=dtypes)\n",
    "valid = gd.read_csv(valid_name, sep = '\\t', header=None, names=cols, dtypes=dtypes)\n",
    "\n",
    "used_cols = ['label']+cat_cols[:3]\n",
    "\n",
    "train = train[used_cols]\n",
    "valid = valid[used_cols]\n",
    "train.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The categorical columns are strings originally so we need some kind of encoding to turn them into numerical columns."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"onehot\"></a>\n",
    "### Why not one-hot encoding?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for col in cat_cols[:3]:\n",
    "    print(col,'cardinality',len(train[col].unique()), len(valid[col].unique()))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With such high cardinality, it is inefficient to do one-hot encoding because it leads to either huge memory consumption or very sparse data, which is less optimized when running on GPU.\n",
    "\n",
    "Therefore, we use label encoding to transform such string columns to numerical columns."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"lbl\"></a>\n",
    "### Label encoding"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "for col in cat_cols[:3]:\n",
    "    train[col] = train[col].fillna('None')\n",
    "    valid[col] = valid[col].fillna('None')\n",
    "    lbl = LabelEncoder()\n",
    "    lbl.fit(gd.concat([train[col],valid[col]]))\n",
    "    train[col] = lbl.transform(train[col])\n",
    "    valid[col] = lbl.transform(valid[col])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Label encoding transforms string columns to integer columns. However, the mapping from a string to an integer is arbitrary, which makes the encoded features less informative. For example, the first three rows of `cat_2` are `9218`, `5875` and `5199`. Although `5875` is closer to `5199` than `9218`, there is absolutely no guarantee that the string of `5875` is more similar to string of `5199` than string of `9218`. In other words, a tree classifier has make many splits to learn the pattern buried within such encoded features.   "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"lblxgb\"></a>\n",
    "### Train XGB with label encoding features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "xgb_parms = { \n",
    "    'max_depth':6, \n",
    "    'learning_rate':0.1, \n",
    "    'subsample':0.8,\n",
    "    'colsample_bytree':1.0, \n",
    "    'eval_metric':'auc',\n",
    "    'objective':'binary:logistic',\n",
    "    'tree_method':'gpu_hist',\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "NROUND = 100\n",
    "VERBOSE_EVAL = 10\n",
    "ESR = 10\n",
    "\n",
    "start = time.time(); print('Creating DMatrix...')\n",
    "dtrain = xgb.DMatrix(data=train.drop('label',axis=1),label=train['label'])\n",
    "dvalid = xgb.DMatrix(data=valid.drop('label',axis=1),label=valid['label'])\n",
    "print('Took %.1f seconds'%(time.time()-start))\n",
    "\n",
    "start = time.time(); print('Training...')\n",
    "model = xgb.train(xgb_parms, \n",
    "                       dtrain=dtrain,\n",
    "                       evals=[(dtrain,'train'),(dvalid,'valid')],\n",
    "                       num_boost_round=NROUND,\n",
    "                       early_stopping_rounds=ESR,\n",
    "                       verbose_eval=VERBOSE_EVAL) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As shown above, using label encoding features results in a valid auc score of 0.60. Let's see if target encoding can improve this score."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"tar\"></a>\n",
    "## 2. Target encoding"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The idea of target encoding is very simple: we encode the categorical column by the mean value of the `target` of the group associated with each unique value of the categorical column, where `target` is the ground truth column to be predicted. In other words, it is essentially just a simple groupby-aggregation-merge or `groupby-transform`, in pandas terms:<br> `df['fea_encode'] = df.groupby('fea')['target'].transform(lambda x: x.mean())`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"naive\"></a>\n",
    "### A naive implementation\n",
    "Let's implement targe encoding of the idea above and study where we can improve."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "for col in cat_cols[:3]:\n",
    "    tmp = train.groupby(col, as_index=False).agg({'label':'mean'})\n",
    "    tmp.columns = [col, f'{col}_TE']\n",
    "    train = train.merge(tmp, on=col, how='left')\n",
    "    valid = valid.merge(tmp, on=col, how='left')\n",
    "    del tmp\n",
    "train.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will only use the target encoding features to train XGB."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "te_cols = [col for col in train.columns if col.endswith('TE')]\n",
    "print(te_cols)\n",
    "\n",
    "start = time.time(); print('Creating DMatrix...')\n",
    "dtrain = xgb.DMatrix(data=train[te_cols],label=train['label'])\n",
    "dvalid = xgb.DMatrix(data=valid[te_cols],label=valid['label'])\n",
    "print('Took %.1f seconds'%(time.time()-start))\n",
    "\n",
    "start = time.time(); print('Training...')\n",
    "model = xgb.train(xgb_parms, \n",
    "                       dtrain=dtrain,\n",
    "                       evals=[(dtrain,'train'),(dvalid,'valid')],\n",
    "                       num_boost_round=NROUND,\n",
    "                       early_stopping_rounds=ESR,\n",
    "                       verbose_eval=VERBOSE_EVAL) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "However the valid auc is not improved with the naive target encoding. Furthermore, the bigger discrepancy between `train auc` and `valid auc` is alarming. It means the naive target encoding suffers from an overfitting problem. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "labels = ['Label encoding', 'Target encoding naive']\n",
    "train_auc = [0.65, 0.84]\n",
    "valid_auc = [0.64, 0.63]\n",
    "\n",
    "x = np.arange(len(labels))  # the label locations\n",
    "width = 0.35  # the width of the bars\n",
    "\n",
    "fig, ax = plt.subplots()\n",
    "rects1 = ax.bar(x - width/2, train_auc, width, label='train auc', color='m')\n",
    "rects2 = ax.bar(x + width/2, valid_auc, width, label='valid auc', color='c')\n",
    "\n",
    "ax.set_ylabel('Auc')\n",
    "ax.set_title('The overfitting problem of naive target encoding')\n",
    "ax.set_xticks(x)\n",
    "ax.set_xticklabels(labels)\n",
    "ax.legend()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The cause is actually obvious. We use the ground truth column directly in creating the features for the training data, which doesn't generalize to validation data  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"kfold\"></a>\n",
    "### A K-fold cross validate implementation\n",
    "To alleviate such overfitting, we can encode the traning data in k-folds, so that a sample's ground truth is not touched when creating its target encoding feature. The procedure is shown in the animation below.<br>\n",
    "![ChessUrl](https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F100236%2F64cc45bbe25144503bc93cf4b9e102f1%2Fmte.gif?generation=1594620515929361&alt=media \"chess\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# drop the naive TE columns\n",
    "train = train.drop(te_cols, axis=1)\n",
    "valid = valid.drop(te_cols, axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "FOLDS = 10\n",
    "train['fold'] = cp.arange(len(train))%FOLDS\n",
    "train['row_id'] = cp.arange(len(train))\n",
    "mean = train['label'].mean()\n",
    "for col in cat_cols[:3]:\n",
    "    res = []\n",
    "    out_col = f'{col}_TE'\n",
    "    for i in range(FOLDS):\n",
    "        tmp = train[train['fold']!=i].groupby(col, as_index=False).agg({'label':'mean'})\n",
    "        tmp.columns = [col, out_col]\n",
    "        tr = train[train['fold']==i][['row_id',col]]\n",
    "        tr = tr.merge(tmp,on=col,how='left')\n",
    "        res.append(tr)\n",
    "        del tmp\n",
    "    res = gd.concat(res)\n",
    "    res = res.sort_values('row_id')\n",
    "    train[out_col] = res[out_col].fillna(mean).values\n",
    "    del res\n",
    "    tmp = train.groupby(col, as_index=False).agg({'label':'mean'})\n",
    "    tmp.columns = [col, out_col]\n",
    "    valid = valid.merge(tmp, on=col, how='left')\n",
    "    del tmp\n",
    "train.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A key observation here is that training data and test/validation data are encoded differently. The training data is encoded using this *fancy kfold cross validated* fashion while test data is encoded just using *group mean*. Comparing to `LabelEncoder`, the implication is that with `TargetEncoder`we can't use the exactly same api `transform` for both `training data` and `test data`.\n",
    "\n",
    "```\n",
    "# Using transform for both data works\n",
    "lbl = LabelEncoder()\n",
    "lbl.fit(gd.concat([train[col],valid[col]]))\n",
    "train[col] = lbl.transform(train[col])\n",
    "valid[col] = lbl.transform(valid[col])\n",
    "\n",
    "# Using transform for both data doesn't work\n",
    "tar = TargetEncoder()\n",
    "tar.fit(train[col], train['target'])\n",
    "train[col] = tar.transform(train[col]) \n",
    "valid[col] = tar.transform(valid[col])\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "te_cols = [col for col in train.columns if col.endswith('TE')]\n",
    "print(te_cols)\n",
    "\n",
    "start = time.time(); print('Creating DMatrix...')\n",
    "dtrain = xgb.DMatrix(data=train[te_cols],label=train['label'])\n",
    "dvalid = xgb.DMatrix(data=valid[te_cols],label=valid['label'])\n",
    "print('Took %.1f seconds'%(time.time()-start))\n",
    "\n",
    "start = time.time(); print('Training...')\n",
    "model = xgb.train(xgb_parms, \n",
    "                       dtrain=dtrain,\n",
    "                       evals=[(dtrain,'train'),(dvalid,'valid')],\n",
    "                       num_boost_round=NROUND,\n",
    "                       early_stopping_rounds=ESR,\n",
    "                       verbose_eval=VERBOSE_EVAL) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "labels = ['Label encoding', 'Target encoding naive', 'Target encoding kfold for loop']\n",
    "train_auc = [0.65, 0.84, 0.71]\n",
    "valid_auc = [0.64, 0.63, 0.7]\n",
    "\n",
    "x = np.arange(len(labels))  # the label locations\n",
    "width = 0.35  # the width of the bars\n",
    "\n",
    "fig, ax = plt.subplots()\n",
    "fig.set_figwidth(15)\n",
    "rects1 = ax.bar(x - width/2, train_auc, width, label='train auc', color='m')\n",
    "rects2 = ax.bar(x + width/2, valid_auc, width, label='valid auc', color='c')\n",
    "\n",
    "ax.set_ylabel('Auc')\n",
    "ax.set_title('The overfitting problem is fixed by kfold target encoding')\n",
    "ax.set_xticks(x)\n",
    "ax.set_xticklabels(labels)\n",
    "ax.legend()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"opt\"></a>\n",
    "### An optimized implementation\n",
    "We can make further improvements:\n",
    "- calculate the encoding in one shot instead of the for loop.\n",
    "- encode one column or many columns jointly\n",
    "- smooth the encoding so that it is not skewed by infrequent values.\n",
    "- support both single and multi gpus"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# drop the previous TE columns\n",
    "train = train.drop(te_cols, axis=1)\n",
    "valid = valid.drop(te_cols, axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that the optimized implementation is about 6x faster than the previous `for loop` based implementation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "SMOOTH = 0.001\n",
    "SPLIT = 'interleaved'\n",
    "for col in cat_cols[:3]:\n",
    "    out_col = f'{col}_TE'\n",
    "    encoder = TargetEncoder(n_folds=FOLDS, smooth=SMOOTH, split_method=SPLIT)\n",
    "    #train[out_col] = encoder.fit_transform(train[col], train['label'])\n",
    "    encoder.fit(train[col], train['label'])\n",
    "    train[out_col] = encoder.transform(train[col])\n",
    "    valid[out_col] = encoder.transform(valid[col])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "te_cols = [col for col in train.columns if col.endswith('TE')]\n",
    "print(te_cols)\n",
    "\n",
    "start = time.time(); print('Creating DMatrix...')\n",
    "dtrain = xgb.DMatrix(data=train[te_cols],label=train['label'])\n",
    "dvalid = xgb.DMatrix(data=valid[te_cols],label=valid['label'])\n",
    "print('Took %.1f seconds'%(time.time()-start))\n",
    "\n",
    "start = time.time(); print('Training...')\n",
    "model = xgb.train(xgb_parms, \n",
    "                       dtrain=dtrain,\n",
    "                       evals=[(dtrain,'train'),(dvalid,'valid')],\n",
    "                       num_boost_round=NROUND,\n",
    "                       early_stopping_rounds=ESR,\n",
    "                       verbose_eval=VERBOSE_EVAL) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The optimized version is slightly more accurate and it could be up to 10x faster than the `kfold for loop` implementation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "labels = ['Label encoding', 'Target encoding naive', 'Target encoding kfold for loop', 'Target encoding optimized']\n",
    "train_auc = [0.65, 0.84, 0.71, 0.72]\n",
    "valid_auc = [0.64, 0.63, 0.7, 0.704]\n",
    "\n",
    "x = np.arange(len(labels))  # the label locations\n",
    "width = 0.35  # the width of the bars\n",
    "\n",
    "fig, ax = plt.subplots()\n",
    "fig.set_figwidth(15)\n",
    "rects1 = ax.bar(x - width/2, train_auc, width, label='train auc', color='m')\n",
    "rects2 = ax.bar(x + width/2, valid_auc, width, label='valid auc', color='c')\n",
    "\n",
    "ax.set_ylabel('Auc')\n",
    "ax.set_title('The overfitting problem is fixed by kfold target encoding')\n",
    "ax.set_xticks(x)\n",
    "ax.set_xticklabels(labels)\n",
    "ax.legend()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"multi\"></a>\n",
    "### Multi-column joint encoding\n",
    "Instead of encoding one column at a time, we can also encoding multiple columns jointly into one new feature."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "for cols in [['cat_0', 'cat_1'],\n",
    "             ['cat_0', 'cat_2'],\n",
    "             ['cat_1', 'cat_2'],\n",
    "             ['cat_0', 'cat_1', 'cat_2']\n",
    "            ]:\n",
    "    out_col = '_'.join(cols)+'_TE'\n",
    "    encoder = TargetEncoder(n_folds=FOLDS,smooth=SMOOTH, split_method=SPLIT)\n",
    "    train[out_col] = encoder.fit_transform(train[cols], train['label'])\n",
    "    valid[out_col] = encoder.transform(valid[cols])\n",
    "    del encoder"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "te_cols = [col for col in train.columns if col.endswith('TE')]\n",
    "print(te_cols)\n",
    "\n",
    "start = time.time(); print('Creating DMatrix...')\n",
    "dtrain = xgb.DMatrix(data=train[te_cols],label=train['label'])\n",
    "dvalid = xgb.DMatrix(data=valid[te_cols],label=valid['label'])\n",
    "print('Took %.1f seconds'%(time.time()-start))\n",
    "\n",
    "start = time.time(); print('Training...')\n",
    "model = xgb.train(xgb_parms, \n",
    "                       dtrain=dtrain,\n",
    "                       evals=[(dtrain,'train'),(dvalid,'valid')],\n",
    "                       num_boost_round=NROUND,\n",
    "                       early_stopping_rounds=ESR,\n",
    "                       verbose_eval=VERBOSE_EVAL) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Although the validation AUC doesn't improve much for this dataset, the functionality of multi-column joint encoding is necessary and might improve the prediction for other datasets."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Conclusion"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this notebook, we explains the key design choices of target encoding. The takeaways are:\n",
    "- The ground truth column `target` is used as input for encoding.\n",
    "- The training data and test data are transformed differently.\n",
    "- Multi-column joint transformation is supported by target encoding."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
